Crash Course in Pandas

Pandas is a very popular Python package that provides data structures and data analysis tools. It includes tools for reading and writing various data formats, processing data sets in an efficient DataFrame object, and the ability to reshape, filter, index, and subset data easily.

This notebook is a quick introduction to some of the features of the Pandas package.

Reading Data into DataFrames

The primary data structure in the Pandas package is the DataFrame. A DataFrame is a table-like object that contains named columns of data. Pandas provides readers for many different file types including CSV, Excel, SAS, HTML, etc. Since CSV is such a common format, let's look at reading a CSV file into a Pandas DataFrame.

First, we need to import the Pandas package. A very common convention is to import Pandas using a shortcut named of 'pd'. This is done with the following line.



In [1]:

    
import pandas as pd

The function that reads CSV files into DataFrames is called read_csv. In the simplest form, you just supply it with a filename or URL. We have the SAS cars data set stored in github that we can point to directly. We'll use the head method to display just the first few records.



In [2]:

    
df = pd.read_csv('https://raw.githubusercontent.com/'
                 'sassoftware/sas-viya-programming/master/data/cars.csv')
df.head()









    Out[2]:






  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      0
      Acura
      MDX
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      1
      Acura
      RSX Type S 2dr
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      2
      Acura
      TSX 4dr
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

There are many more Pandas data readers that you read about on the Pandas web site.

Displaying Information about DataFrames

We have displayed the DataFrame above. We can get more information about the DataFrame using various properties and methods.

The list of column names can be displayed using the columns property.



In [3]:

    
df.columns









    Out[3]:





Index(['Make', 'Model', 'Type', 'Origin', 'DriveTrain', 'MSRP', 'Invoice',
       'EngineSize', 'Cylinders', 'Horsepower', 'MPG_City', 'MPG_Highway',
       'Weight', 'Wheelbase', 'Length'],
      dtype='object')

The data types of the columns can be displayed using the dtypes property.



In [4]:

    
df.dtypes









    Out[4]:





Make            object
Model           object
Type            object
Origin          object
DriveTrain      object
MSRP           float64
Invoice        float64
EngineSize     float64
Cylinders      float64
Horsepower     float64
MPG_City       float64
MPG_Highway    float64
Weight         float64
Wheelbase      float64
Length         float64
dtype: object

For general information about the DataFrame as a whole, you can use the info method.



In [5]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 428 entries, 0 to 427
Data columns (total 15 columns):
Make           428 non-null object
Model          428 non-null object
Type           428 non-null object
Origin         428 non-null object
DriveTrain     428 non-null object
MSRP           428 non-null float64
Invoice        428 non-null float64
EngineSize     428 non-null float64
Cylinders      426 non-null float64
Horsepower     428 non-null float64
MPG_City       428 non-null float64
MPG_Highway    428 non-null float64
Weight         428 non-null float64
Wheelbase      428 non-null float64
Length         428 non-null float64
dtypes: float64(10), object(5)
memory usage: 50.2+ KB

Now that we know more about the columns and their data types, we can move on to subsetting DataFrames into other DataFrames or columns.

Subsetting DataFrames

You can select columns from a DataFrame to create a DataFrame with fewer columns, or you can select a single column of data into a Series. The syntax used for this operation is Python's indexing syntax [...].

If you index into a DataFrame using just the name of a column, you will get a Series back. If you index into a DataFrame using a list of names, you will get a DataFrame back.



In [6]:

    
subdf = df[['Make', 'Model', 'Horsepower']]
subdf.head()









    Out[6]:






  
    
      
      Make
      Model
      Horsepower
    
  
  
    
      0
      Acura
      MDX
      265.0
    
    
      1
      Acura
      RSX Type S 2dr
      200.0
    
    
      2
      Acura
      TSX 4dr
      200.0
    
    
      3
      Acura
      TL 4dr
      270.0
    
    
      4
      Acura
      3.5 RL 4dr
      225.0



In [7]:

    
model = df['Model']
model.head()









    Out[7]:





0               MDX
1    RSX Type S 2dr
2           TSX 4dr
3            TL 4dr
4        3.5 RL 4dr
Name: Model, dtype: object

In addition to selecting columns of data, DataFrames have more advanced ways of selecting data. We'll look at that in the next section.

Row Indexing

Pandas DataFrames have an index associated both with rows and columns. We have seen the column index in the columns property above. We have also indexed into that collection using the column names. You can also index rows by position and label using the iloc and loc properties, respectively.

Currently, our cars DataFrame only has numeric row indexes, it doesn't have labels, so we'll start with iloc.



In [8]:

    
df.iloc[3:7]









    Out[8]:






  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      3
      Acura
      TL 4dr
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      5
      Acura
      3.5 RL w/Navigation 4dr
      Sedan
      Asia
      Front
      46100.0
      41100.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3893.0
      115.0
      197.0
    
    
      6
      Acura
      NSX coupe 2dr manual S
      Sports
      Asia
      Rear
      89765.0
      79978.0
      3.2
      6.0
      290.0
      17.0
      24.0
      3153.0
      100.0
      174.0

As you can see above, we selected the row indexs from 3 to 7. In Python, numeric slices such as this include the first index, but do not include the last index. That is why you see row indexes 3, 4, 5, and 6 only in the result.

You can also select rows using a list of row indexes.



In [9]:

    
df.iloc[[4, 5, 10]]









    Out[9]:






  
    
      
      Make
      Model
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      4
      Acura
      3.5 RL 4dr
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      5
      Acura
      3.5 RL w/Navigation 4dr
      Sedan
      Asia
      Front
      46100.0
      41100.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3893.0
      115.0
      197.0
    
    
      10
      Audi
      A4 3.0 Quattro 4dr manual
      Sedan
      Europe
      All
      33430.0
      30366.0
      3.0
      6.0
      220.0
      17.0
      26.0
      3583.0
      104.0
      179.0

Arguably a more useful way of indexing Pandas DataFrames is using row labels. We can set a column as a row index using the set_index method. Then we can use those labels in the loc property for easy data selection.



In [10]:

    
df = df.set_index('Model')



In [11]:

    
df.head()









    Out[11]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      MDX
      Acura
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      RSX Type S 2dr
      Acura
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      TSX 4dr
      Acura
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      TL 4dr
      Acura
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0
    
    
      3.5 RL 4dr
      Acura
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0

You'll see in the output above that the Model column is now displayed as a row index. We can use the values in that index in the loc property. This includes both single values, lists of values, or slices.



In [12]:

    
df.loc['MDX']









    Out[12]:





Make           Acura
Type             SUV
Origin          Asia
DriveTrain       All
MSRP           36945
Invoice        33337
EngineSize       3.5
Cylinders          6
Horsepower       265
MPG_City          17
MPG_Highway       23
Weight          4451
Wheelbase        106
Length           189
Name: MDX, dtype: object



In [13]:

    
df.loc[['TSX 4dr', 'TL 4dr']]









    Out[13]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      TSX 4dr
      Acura
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      TL 4dr
      Acura
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0



In [14]:

    
df.loc['MDX':'TL 4dr']









    Out[14]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      MDX
      Acura
      SUV
      Asia
      All
      36945.0
      33337.0
      3.5
      6.0
      265.0
      17.0
      23.0
      4451.0
      106.0
      189.0
    
    
      RSX Type S 2dr
      Acura
      Sedan
      Asia
      Front
      23820.0
      21761.0
      2.0
      4.0
      200.0
      24.0
      31.0
      2778.0
      101.0
      172.0
    
    
      TSX 4dr
      Acura
      Sedan
      Asia
      Front
      26990.0
      24647.0
      2.4
      4.0
      200.0
      22.0
      29.0
      3230.0
      105.0
      183.0
    
    
      TL 4dr
      Acura
      Sedan
      Asia
      Front
      33195.0
      30299.0
      3.2
      6.0
      270.0
      20.0
      28.0
      3575.0
      108.0
      186.0

Note that when using slicing on labels, the endpoint for the slice is included. This is different than for row indexes.

Indexing Both Rows and Columns

The iloc and loc properties can be used to subset both rows and columns simultaneously. However, the iloc property must only contain positional row and column index information, and loc must only contain row and column labels.



In [15]:

    
df.iloc[1:5, 3:6]









    Out[15]:






  
    
      
      DriveTrain
      MSRP
      Invoice
    
    
      Model
      
      
      
    
  
  
    
      RSX Type S 2dr
      Front
      23820.0
      21761.0
    
    
      TSX 4dr
      Front
      26990.0
      24647.0
    
    
      TL 4dr
      Front
      33195.0
      30299.0
    
    
      3.5 RL 4dr
      Front
      43755.0
      39014.0



In [16]:

    
df.loc['MDX':'TL 4dr', 'MSRP':'Horsepower']









    Out[16]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
    
    
      Model
      
      
      
      
      
    
  
  
    
      MDX
      36945.0
      33337.0
      3.5
      6.0
      265.0
    
    
      RSX Type S 2dr
      23820.0
      21761.0
      2.0
      4.0
      200.0
    
    
      TSX 4dr
      26990.0
      24647.0
      2.4
      4.0
      200.0
    
    
      TL 4dr
      33195.0
      30299.0
      3.2
      6.0
      270.0

If you want to mix positional and label information between rows and columns, you need to use the ix property.



In [17]:

    
df.ix[2:6, 'MSRP':'Cylinders']









    Out[17]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
    
    
      Model
      
      
      
      
    
  
  
    
      TSX 4dr
      26990.0
      24647.0
      2.4
      4.0
    
    
      TL 4dr
      33195.0
      30299.0
      3.2
      6.0
    
    
      3.5 RL 4dr
      43755.0
      39014.0
      3.5
      6.0
    
    
      3.5 RL w/Navigation 4dr
      46100.0
      41100.0
      3.5
      6.0

Of course, the ix property also supports single values, lists of values, and slices for each component.



In [18]:

    
df.ix[[1,2,5], ['Make', 'MSRP', 'Horsepower']]









    Out[18]:






  
    
      
      Make
      MSRP
      Horsepower
    
    
      Model
      
      
      
    
  
  
    
      RSX Type S 2dr
      Acura
      23820.0
      200.0
    
    
      TSX 4dr
      Acura
      26990.0
      200.0
    
    
      3.5 RL w/Navigation 4dr
      Acura
      46100.0
      225.0

Boolean Indexing

A more dynamic way of indexing DataFrames is to use boolean indexing. Rather than specifying explicit values for the index, we will use an expression to indicate which rows we want to select. The expression creates a boolean Series which indicates the rows to keep. Here is an example of a boolean series created by comparing the MSRP column to 40000.



In [19]:

    
df['MSRP'] > 40000









    Out[19]:





Model
MDX                                 False
RSX Type S 2dr                      False
TSX 4dr                             False
TL 4dr                              False
3.5 RL 4dr                           True
3.5 RL w/Navigation 4dr              True
NSX coupe 2dr manual S               True
A4 1.8T 4dr                         False
A41.8T convertible 2dr              False
A4 3.0 4dr                          False
A4 3.0 Quattro 4dr manual           False
A4 3.0 Quattro 4dr auto             False
A6 3.0 4dr                          False
A6 3.0 Quattro 4dr                  False
A4 3.0 convertible 2dr               True
A4 3.0 Quattro convertible 2dr       True
A6 2.7 Turbo Quattro 4dr             True
A6 4.2 Quattro 4dr                   True
A8 L Quattro 4dr                     True
S4 Quattro 4dr                       True
RS 6 4dr                             True
TT 1.8 convertible 2dr (coupe)      False
TT 1.8 Quattro 2dr (convertible)    False
TT 3.2 coupe 2dr (convertible)       True
A6 3.0 Avant Quattro                 True
S4 Avant Quattro                     True
X3 3.0i                             False
X5 4.4i                              True
325i 4dr                            False
325Ci 2dr                           False
                                    ...  
Tundra Regular Cab V6               False
Tundra Access Cab V6 SR5            False
Matrix XR                           False
Touareg V6                          False
Golf GLS 4dr                        False
GTI 1.8T 2dr hatch                  False
Jetta GLS TDI 4dr                   False
New Beetle GLS 1.8T 2dr             False
Jetta GLI VR6 4dr                   False
New Beetle GLS convertible 2dr      False
Passat GLS 4dr                      False
Passat GLX V6 4MOTION 4dr           False
Passat W8 4MOTION 4dr               False
Phaeton 4dr                          True
Phaeton W12 4dr                      True
Jetta GL                            False
Passat GLS 1.8T                     False
Passat W8                            True
XC90 T6                              True
S40 4dr                             False
S60 2.5 4dr                         False
S60 T5 4dr                          False
S60 R 4dr                           False
S80 2.9 4dr                         False
S80 2.5T 4dr                        False
C70 LPT convertible 2dr              True
C70 HPT convertible 2dr              True
S80 T6 4dr                           True
V40                                 False
XC70                                False
Name: MSRP, dtype: bool

As you can see, the values of the Series are all True or False. If we use that condition as an index value of a DataFrame, we'll get a new DataFrame containing only the rows where the condition was true.



In [20]:

    
df[df['MSRP'] > 40000].head()









    Out[20]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      3.5 RL 4dr
      Acura
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      3.5 RL w/Navigation 4dr
      Acura
      Sedan
      Asia
      Front
      46100.0
      41100.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3893.0
      115.0
      197.0
    
    
      NSX coupe 2dr manual S
      Acura
      Sports
      Asia
      Rear
      89765.0
      79978.0
      3.2
      6.0
      290.0
      17.0
      24.0
      3153.0
      100.0
      174.0
    
    
      A4 3.0 convertible 2dr
      Audi
      Sedan
      Europe
      Front
      42490.0
      38325.0
      3.0
      6.0
      220.0
      20.0
      27.0
      3814.0
      105.0
      180.0
    
    
      A4 3.0 Quattro convertible 2dr
      Audi
      Sedan
      Europe
      All
      44240.0
      40075.0
      3.0
      6.0
      220.0
      18.0
      25.0
      4013.0
      105.0
      180.0

You can combine conditions using & for "and" and | for "or". For example, if we wanted cars that had an MSRP greater than $40,000, but also had 8 or more Cylinders, we could do the following. Note that due to the order of precedence of comparison and combination operatiors, you need to surround your comparions with parentheses.



In [21]:

    
df[(df['MSRP'] > 40000) & (df['Cylinders'] > 8)].head()









    Out[21]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Viper SRT-10 convertible 2dr
      Dodge
      Sports
      USA
      Rear
      81795.0
      74451.0
      8.3
      10.0
      500.0
      12.0
      20.0
      3410.0
      99.0
      176.0
    
    
      Excursion 6.8 XLT
      Ford
      SUV
      USA
      All
      41475.0
      36494.0
      6.8
      10.0
      310.0
      10.0
      13.0
      7190.0
      137.0
      227.0
    
    
      CL600 2dr
      Mercedes-Benz
      Sedan
      Europe
      Rear
      128420.0
      119600.0
      5.5
      12.0
      493.0
      13.0
      19.0
      4473.0
      114.0
      196.0
    
    
      SL600 convertible 2dr
      Mercedes-Benz
      Sports
      Europe
      Rear
      126670.0
      117854.0
      5.5
      12.0
      493.0
      13.0
      19.0
      4429.0
      101.0
      179.0
    
    
      Phaeton W12 4dr
      Volkswagen
      Sedan
      Europe
      Front
      75000.0
      69130.0
      6.0
      12.0
      420.0
      12.0
      19.0
      5399.0
      118.0
      204.0

Sorting

Sorting in Pandas DataFrames can be done according to the index or column values. The methods used to sort a DataFrame are sort_index for sorting by the index and sort_values for sorting by the data values.

Since we still have our index set as Model, let's sort by that first.



In [22]:

    
df.sort_index().head()









    Out[22]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      3.5 RL 4dr
      Acura
      Sedan
      Asia
      Front
      43755.0
      39014.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3880.0
      115.0
      197.0
    
    
      3.5 RL w/Navigation 4dr
      Acura
      Sedan
      Asia
      Front
      46100.0
      41100.0
      3.5
      6.0
      225.0
      18.0
      24.0
      3893.0
      115.0
      197.0
    
    
      300M 4dr
      Chrysler
      Sedan
      USA
      Front
      29865.0
      27797.0
      3.5
      6.0
      250.0
      18.0
      27.0
      3581.0
      113.0
      198.0
    
    
      300M Special Edition 4dr
      Chrysler
      Sedan
      USA
      Front
      33295.0
      30884.0
      3.5
      6.0
      255.0
      18.0
      27.0
      3650.0
      113.0
      198.0
    
    
      325Ci 2dr
      BMW
      Sedan
      Europe
      Rear
      30795.0
      28245.0
      2.5
      6.0
      184.0
      20.0
      29.0
      3197.0
      107.0
      177.0

Sorting by values is done by specify the columns that you want to sort by.



In [23]:

    
df.sort_values(['MSRP', 'Horsepower']).head()









    Out[23]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Rio 4dr manual
      Kia
      Sedan
      Asia
      Front
      10280.0
      9875.0
      1.6
      4.0
      104.0
      26.0
      33.0
      2403.0
      95.0
      167.0
    
    
      Accent 2dr hatch
      Hyundai
      Sedan
      Asia
      Front
      10539.0
      10107.0
      1.6
      4.0
      103.0
      29.0
      33.0
      2255.0
      96.0
      167.0
    
    
      Echo 2dr manual
      Toyota
      Sedan
      Asia
      Front
      10760.0
      10144.0
      1.5
      4.0
      108.0
      35.0
      43.0
      2035.0
      93.0
      163.0
    
    
      Ion1 4dr
      Saturn
      Sedan
      USA
      Front
      10995.0
      10319.0
      2.2
      4.0
      140.0
      26.0
      35.0
      2692.0
      103.0
      185.0
    
    
      Rio 4dr auto
      Kia
      Sedan
      Asia
      Front
      11155.0
      10705.0
      1.6
      4.0
      104.0
      25.0
      32.0
      2458.0
      95.0
      167.0

You can also use the ascending= option to specify the order in which the columns should be sorted.



In [24]:

    
df.sort_values(['MSRP', 'Horsepower'], ascending=[False, True]).head()









    Out[24]:






  
    
      
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Model
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      911 GT2 2dr
      Porsche
      Sports
      Europe
      Rear
      192465.0
      173560.0
      3.6
      6.0
      477.0
      17.0
      24.0
      3131.0
      93.0
      175.0
    
    
      CL600 2dr
      Mercedes-Benz
      Sedan
      Europe
      Rear
      128420.0
      119600.0
      5.5
      12.0
      493.0
      13.0
      19.0
      4473.0
      114.0
      196.0
    
    
      SL600 convertible 2dr
      Mercedes-Benz
      Sports
      Europe
      Rear
      126670.0
      117854.0
      5.5
      12.0
      493.0
      13.0
      19.0
      4429.0
      101.0
      179.0
    
    
      SL55 AMG 2dr
      Mercedes-Benz
      Sports
      Europe
      Rear
      121770.0
      113388.0
      5.5
      8.0
      493.0
      14.0
      21.0
      4235.0
      101.0
      179.0
    
    
      CL500 2dr
      Mercedes-Benz
      Sedan
      Europe
      Rear
      94820.0
      88324.0
      5.0
      8.0
      302.0
      16.0
      24.0
      4085.0
      114.0
      196.0

Keep in mind that the sort_index and sort_values methods return a new DataFrame with the sorting applied. If you want to sort a DataFrame in place, you need to specify the inplace=True option.

Let's reset the index before moving on to the next sections.



In [25]:

    
df.reset_index(inplace=True)

Doing Simple Statistics

Pandas DataFrames have many builtin methods for doing simple statistics. Probably the most common one is describe. It computes various basic statistics for the entire DataFrame.



In [26]:

    
df.describe()









    



/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[26]:






  
    
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      428.000000
      428.000000
      428.000000
      426.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
    
    
      mean
      32774.855140
      30014.700935
      3.196729
      5.807512
      215.885514
      20.060748
      26.843458
      3577.953271
      108.154206
      186.362150
    
    
      std
      19431.716674
      17642.117750
      1.108595
      1.558443
      71.836032
      5.238218
      5.741201
      758.983215
      8.311813
      14.357991
    
    
      min
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      10.000000
      12.000000
      1850.000000
      89.000000
      143.000000
    
    
      25%
      20334.250000
      18866.000000
      2.375000
      NaN
      165.000000
      17.000000
      24.000000
      3104.000000
      103.000000
      178.000000
    
    
      50%
      27635.000000
      25294.500000
      3.000000
      NaN
      210.000000
      19.000000
      26.000000
      3474.500000
      107.000000
      187.000000
    
    
      75%
      39205.000000
      35710.250000
      3.900000
      NaN
      255.000000
      21.250000
      29.000000
      3977.750000
      112.000000
      194.000000
    
    
      max
      192465.000000
      173560.000000
      8.300000
      12.000000
      500.000000
      60.000000
      66.000000
      7190.000000
      144.000000
      238.000000

The describe method has a couple of options to select different columns of data (include=) and the percentiles to compute (percentiles=). In the example below, we are selecting all of the columns and changing the displayed percentiles to 30% and 70% (50% is always displayed).



In [27]:

    
df.describe(include='all', percentiles=[0.3, 0.7])









    



/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[27]:






  
    
      
      Model
      Make
      Type
      Origin
      DriveTrain
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
  
  
    
      count
      428
      428
      428
      428
      428
      428.000000
      428.000000
      428.000000
      426.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
      428.000000
    
    
      unique
      425
      38
      6
      3
      3
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      top
      G35 4dr
      Toyota
      Sedan
      Asia
      Front
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      freq
      2
      28
      262
      158
      226
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      mean
      NaN
      NaN
      NaN
      NaN
      NaN
      32774.855140
      30014.700935
      3.196729
      5.807512
      215.885514
      20.060748
      26.843458
      3577.953271
      108.154206
      186.362150
    
    
      std
      NaN
      NaN
      NaN
      NaN
      NaN
      19431.716674
      17642.117750
      1.108595
      1.558443
      71.836032
      5.238218
      5.741201
      758.983215
      8.311813
      14.357991
    
    
      min
      NaN
      NaN
      NaN
      NaN
      NaN
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      10.000000
      12.000000
      1850.000000
      89.000000
      143.000000
    
    
      30%
      NaN
      NaN
      NaN
      NaN
      NaN
      22001.000000
      20288.100000
      2.400000
      NaN
      172.100000
      18.000000
      25.000000
      3217.000000
      104.000000
      178.000000
    
    
      50%
      NaN
      NaN
      NaN
      NaN
      NaN
      27635.000000
      25294.500000
      3.000000
      NaN
      210.000000
      19.000000
      26.000000
      3474.500000
      107.000000
      187.000000
    
    
      70%
      NaN
      NaN
      NaN
      NaN
      NaN
      35938.000000
      32987.500000
      3.600000
      NaN
      240.000000
      21.000000
      29.000000
      3835.300000
      111.000000
      192.000000
    
    
      max
      NaN
      NaN
      NaN
      NaN
      NaN
      192465.000000
      173560.000000
      8.300000
      12.000000
      500.000000
      60.000000
      66.000000
      7190.000000
      144.000000
      238.000000

You can also select individual statistics using their own methods.



In [28]:

    
df.min()









    Out[28]:





Model          3.5 RL 4dr
Make                Acura
Type               Hybrid
Origin               Asia
DriveTrain            All
MSRP                10280
Invoice              9875
EngineSize            1.3
Cylinders               3
Horsepower             73
MPG_City               10
MPG_Highway            12
Weight               1850
Wheelbase              89
Length                143
dtype: object



In [29]:

    
df.median()









    Out[29]:





MSRP           27635.0
Invoice        25294.5
EngineSize         3.0
Cylinders          6.0
Horsepower       210.0
MPG_City          19.0
MPG_Highway       26.0
Weight          3474.5
Wheelbase        107.0
Length           187.0
dtype: float64

All of the above methods also work on individual columns, or subsets of columns.



In [30]:

    
df['MSRP'].describe()









    Out[30]:





count       428.000000
mean      32774.855140
std       19431.716674
min       10280.000000
25%       20334.250000
50%       27635.000000
75%       39205.000000
max      192465.000000
Name: MSRP, dtype: float64



In [31]:

    
df['MSRP'].min()









    Out[31]:





10280.0



In [32]:

    
df[['MSRP', 'Horsepower']].describe()









    Out[32]:






  
    
      
      MSRP
      Horsepower
    
  
  
    
      count
      428.000000
      428.000000
    
    
      mean
      32774.855140
      215.885514
    
    
      std
      19431.716674
      71.836032
    
    
      min
      10280.000000
      73.000000
    
    
      25%
      20334.250000
      165.000000
    
    
      50%
      27635.000000
      210.000000
    
    
      75%
      39205.000000
      255.000000
    
    
      max
      192465.000000
      500.000000

Grouping Data

Another common operation in analyzing data is grouping by variable values. This is primarily done using the groupby method of DataFrames.



In [33]:

    
grpdf = df.groupby('Origin')
grpdf









    Out[33]:





<pandas.core.groupby.DataFrameGroupBy object at 0x7faed59673c8>

You'll notice that in this case the returned value is a DataFrameGroupBy object. Many of the methods available on a DataFrame will also work on the DataFrameGroupBy object.



In [34]:

    
grpdf.describe()









    



/opt/anaconda3/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[34]:






  
    
      
      
      MSRP
      Invoice
      EngineSize
      Cylinders
      Horsepower
      MPG_City
      MPG_Highway
      Weight
      Wheelbase
      Length
    
    
      Origin
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Asia
      count
      158.000000
      158.000000
      158.000000
      156.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
      158.000000
    
    
      mean
      24741.322785
      22602.177215
      2.774051
      5.185897
      190.702532
      22.012658
      28.265823
      3319.316456
      105.886076
      182.816456
    
    
      std
      11321.069675
      9842.984880
      0.902310
      1.269008
      59.392627
      6.733307
      6.770503
      717.842132
      7.735249
      12.564148
    
    
      min
      10280.000000
      9875.000000
      1.300000
      3.000000
      73.000000
      13.000000
      17.000000
      1850.000000
      89.000000
      153.000000
    
    
      25%
      17208.000000
      16271.500000
      2.000000
      NaN
      142.000000
      18.000000
      25.000000
      2751.500000
      102.000000
      175.000000
    
    
      50%
      23032.500000
      20949.500000
      2.600000
      NaN
      187.500000
      20.500000
      27.000000
      3280.000000
      105.000000
      183.500000
    
    
      75%
      28787.500000
      26645.000000
      3.500000
      NaN
      233.750000
      24.000000
      31.000000
      3706.750000
      110.000000
      191.000000
    
    
      max
      89765.000000
      79978.000000
      5.600000
      8.000000
      340.000000
      60.000000
      66.000000
      5590.000000
      140.000000
      224.000000
    
    
      Europe
      count
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
      123.000000
    
    
      mean
      48349.796748
      44395.081301
      3.206504
      6.235772
      251.894309
      18.731707
      26.008130
      3680.723577
      106.447154
      181.845528
    
    
      std
      25318.600464
      23080.368888
      1.037157
      1.679585
      80.738884
      3.289509
      4.167588
      579.576871
      6.690099
      11.479870
    
    
      min
      16999.000000
      15437.000000
      1.600000
      4.000000
      100.000000
      12.000000
      14.000000
      2524.000000
      93.000000
      143.000000
    
    
      25%
      33837.500000
      31287.500000
      2.500000
      5.000000
      195.500000
      17.000000
      24.000000
      3285.000000
      102.000000
      177.000000
    
    
      50%
      40590.000000
      37575.000000
      3.000000
      6.000000
      225.000000
      19.000000
      26.000000
      3585.000000
      107.000000
      182.000000
    
    
      75%
      56382.500000
      51042.500000
      4.200000
      8.000000
      302.000000
      20.000000
      28.500000
      3959.500000
      110.000000
      190.000000
    
    
      max
      192465.000000
      173560.000000
      6.000000
      12.000000
      493.000000
      38.000000
      46.000000
      5423.000000
      123.000000
      204.000000
    
    
      USA
      count
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
      147.000000
    
    
      mean
      28377.442177
      25949.340136
      3.642857
      6.108844
      212.823129
      19.074830
      26.013605
      3769.952381
      112.020408
      193.952381
    
    
      std
      11711.982506
      10518.722194
      1.194221
      1.531090
      63.748618
      3.982992
      5.396582
      855.305524
      8.788590
      15.305265
    
    
      min
      10995.000000
      10319.000000
      1.600000
      4.000000
      103.000000
      10.000000
      12.000000
      2348.000000
      93.000000
      150.000000
    
    
      25%
      20315.000000
      18927.000000
      2.700000
      5.000000
      160.000000
      17.000000
      22.000000
      3234.000000
      106.000000
      185.000000
    
    
      50%
      25520.000000
      23217.000000
      3.600000
      6.000000
      200.000000
      18.000000
      26.000000
      3606.000000
      111.000000
      194.000000
    
    
      75%
      33767.500000
      30778.000000
      4.600000
      8.000000
      250.000000
      21.000000
      29.000000
      4235.000000
      115.000000
      201.500000
    
    
      max
      81795.000000
      74451.000000
      8.300000
      10.000000
      500.000000
      29.000000
      37.000000
      7190.000000
      144.000000
      238.000000



In [35]:

    
grpdf[['MSRP', 'Horsepower']].describe(percentiles=[0.2, 0.8])









    Out[35]:






  
    
      
      
      MSRP
      Horsepower
    
    
      Origin
      
      
      
    
  
  
    
      Asia
      count
      158.000000
      158.000000
    
    
      mean
      24741.322785
      190.702532
    
    
      std
      11321.069675
      59.392627
    
    
      min
      10280.000000
      73.000000
    
    
      20%
      15680.800000
      130.000000
    
    
      50%
      23032.500000
      187.500000
    
    
      80%
      30995.000000
      240.000000
    
    
      max
      89765.000000
      340.000000
    
    
      Europe
      count
      123.000000
      123.000000
    
    
      mean
      48349.796748
      251.894309
    
    
      std
      25318.600464
      80.738884
    
    
      min
      16999.000000
      100.000000
    
    
      20%
      32506.000000
      184.000000
    
    
      50%
      40590.000000
      225.000000
    
    
      80%
      64248.000000
      315.000000
    
    
      max
      192465.000000
      493.000000
    
    
      USA
      count
      147.000000
      147.000000
    
    
      mean
      28377.442177
      212.823129
    
    
      std
      11711.982506
      63.748618
    
    
      min
      10995.000000
      103.000000
    
    
      20%
      19099.000000
      150.000000
    
    
      50%
      25520.000000
      200.000000
    
    
      80%
      36079.000000
      275.000000
    
    
      max
      81795.000000
      500.000000

It is also possible to loop through all of the groupings using iteration in Python.



In [36]:

    
for name, group in grpdf:
    print('=' * 35, name, '=' * 35)
    print('')
    print(group.head())
    print('')









    



=================================== Asia ===================================

            Model   Make   Type Origin DriveTrain     MSRP  Invoice  \
0             MDX  Acura    SUV   Asia        All  36945.0  33337.0   
1  RSX Type S 2dr  Acura  Sedan   Asia      Front  23820.0  21761.0   
2         TSX 4dr  Acura  Sedan   Asia      Front  26990.0  24647.0   
3          TL 4dr  Acura  Sedan   Asia      Front  33195.0  30299.0   
4      3.5 RL 4dr  Acura  Sedan   Asia      Front  43755.0  39014.0   

   EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
0         3.5        6.0       265.0      17.0         23.0  4451.0   
1         2.0        4.0       200.0      24.0         31.0  2778.0   
2         2.4        4.0       200.0      22.0         29.0  3230.0   
3         3.2        6.0       270.0      20.0         28.0  3575.0   
4         3.5        6.0       225.0      18.0         24.0  3880.0   

   Wheelbase  Length  
0      106.0   189.0  
1      101.0   172.0  
2      105.0   183.0  
3      108.0   186.0  
4      115.0   197.0  

=================================== Europe ===================================

                        Model  Make   Type  Origin DriveTrain     MSRP  \
7                 A4 1.8T 4dr  Audi  Sedan  Europe      Front  25940.0   
8      A41.8T convertible 2dr  Audi  Sedan  Europe      Front  35940.0   
9                  A4 3.0 4dr  Audi  Sedan  Europe      Front  31840.0   
10  A4 3.0 Quattro 4dr manual  Audi  Sedan  Europe        All  33430.0   
11    A4 3.0 Quattro 4dr auto  Audi  Sedan  Europe        All  34480.0   

    Invoice  EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
7   23508.0         1.8        4.0       170.0      22.0         31.0  3252.0   
8   32506.0         1.8        4.0       170.0      23.0         30.0  3638.0   
9   28846.0         3.0        6.0       220.0      20.0         28.0  3462.0   
10  30366.0         3.0        6.0       220.0      17.0         26.0  3583.0   
11  31388.0         3.0        6.0       220.0      18.0         25.0  3627.0   

    Wheelbase  Length  
7       104.0   179.0  
8       105.0   180.0  
9       104.0   179.0  
10      104.0   179.0  
11      104.0   179.0  

=================================== USA ===================================

                 Model   Make   Type Origin DriveTrain     MSRP  Invoice  \
46             Rainier  Buick    SUV    USA        All  37895.0  34357.0   
47       Rendezvous CX  Buick    SUV    USA      Front  26545.0  24085.0   
48  Century Custom 4dr  Buick  Sedan    USA      Front  22180.0  20351.0   
49  LeSabre Custom 4dr  Buick  Sedan    USA      Front  26470.0  24282.0   
50        Regal LS 4dr  Buick  Sedan    USA      Front  24895.0  22835.0   

    EngineSize  Cylinders  Horsepower  MPG_City  MPG_Highway  Weight  \
46         4.2        6.0       275.0      15.0         21.0  4600.0   
47         3.4        6.0       185.0      19.0         26.0  4024.0   
48         3.1        6.0       175.0      20.0         30.0  3353.0   
49         3.8        6.0       205.0      20.0         29.0  3567.0   
50         3.8        6.0       200.0      20.0         30.0  3461.0   

    Wheelbase  Length  
46      113.0   193.0  
47      112.0   187.0  
48      109.0   195.0  
49      112.0   200.0  
50      109.0   196.0

There are far too many grouping features in Pandas to cover here. For more information, see the Pandas Group By documentation.

Plotting

There are several packages for creating plots in Python. These include matplotlib, seaborn, bokeh, plot.ly, or even Pandas itself. Many of these packages such as seaborn and the Pandas plotting features use matplotlib in the background. Packages like bokeh and plot.ly are primarily focused on graphics that are rendered in a web browser.

The most basic plotting features can be accessed in the plot method of the DataFrame. To start, we can create a scatter plot of the MSRP values.



In [37]:

    
%matplotlib inline



In [38]:

    
df.plot(kind='scatter', x='MSRP', y='Horsepower', figsize=(12,6))









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x7faed529c748>

Using the cufflinks package, which is a DataFrame-style interface to plot.ly, we can create a graph that is rendered in the browser and includes interactive features.



In [39]:

    
import cufflinks

# Configure cufflinks to render the graph locally.
cufflinks.go_offline()

df.iplot(kind='scatter', x='MSRP', y='Horsepower', mode='markers', 
         size=8, dimensions=(750, 400))

For more information on plotting features of DataFrames, see the Pandas Visualization documentation.

Conclusion

We've just covered the very basics of the Pandas package here. You should have enough to get started, but for more information, you should see the official documentation.

	Make	Model	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
0	Acura	MDX	SUV	Asia	All	36945.0	33337.0	3.5	6.0	265.0	17.0	23.0	4451.0	106.0	189.0
1	Acura	RSX Type S 2dr	Sedan	Asia	Front	23820.0	21761.0	2.0	4.0	200.0	24.0	31.0	2778.0	101.0	172.0
2	Acura	TSX 4dr	Sedan	Asia	Front	26990.0	24647.0	2.4	4.0	200.0	22.0	29.0	3230.0	105.0	183.0
3	Acura	TL 4dr	Sedan	Asia	Front	33195.0	30299.0	3.2	6.0	270.0	20.0	28.0	3575.0	108.0	186.0
4	Acura	3.5 RL 4dr	Sedan	Asia	Front	43755.0	39014.0	3.5	6.0	225.0	18.0	24.0	3880.0	115.0	197.0

	Make	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
Model
MDX	Acura	SUV	Asia	All	36945.0	33337.0	3.5	6.0	265.0	17.0	23.0	4451.0	106.0	189.0
RSX Type S 2dr	Acura	Sedan	Asia	Front	23820.0	21761.0	2.0	4.0	200.0	24.0	31.0	2778.0	101.0	172.0
TSX 4dr	Acura	Sedan	Asia	Front	26990.0	24647.0	2.4	4.0	200.0	22.0	29.0	3230.0	105.0	183.0
TL 4dr	Acura	Sedan	Asia	Front	33195.0	30299.0	3.2	6.0	270.0	20.0	28.0	3575.0	108.0	186.0
3.5 RL 4dr	Acura	Sedan	Asia	Front	43755.0	39014.0	3.5	6.0	225.0	18.0	24.0	3880.0	115.0	197.0

	Make	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
Model
Viper SRT-10 convertible 2dr	Dodge	Sports	USA	Rear	81795.0	74451.0	8.3	10.0	500.0	12.0	20.0	3410.0	99.0	176.0
Excursion 6.8 XLT	Ford	SUV	USA	All	41475.0	36494.0	6.8	10.0	310.0	10.0	13.0	7190.0	137.0	227.0
CL600 2dr	Mercedes-Benz	Sedan	Europe	Rear	128420.0	119600.0	5.5	12.0	493.0	13.0	19.0	4473.0	114.0	196.0
SL600 convertible 2dr	Mercedes-Benz	Sports	Europe	Rear	126670.0	117854.0	5.5	12.0	493.0	13.0	19.0	4429.0	101.0	179.0
Phaeton W12 4dr	Volkswagen	Sedan	Europe	Front	75000.0	69130.0	6.0	12.0	420.0	12.0	19.0	5399.0	118.0	204.0

	Make	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
Model
Rio 4dr manual	Kia	Sedan	Asia	Front	10280.0	9875.0	1.6	4.0	104.0	26.0	33.0	2403.0	95.0	167.0
Accent 2dr hatch	Hyundai	Sedan	Asia	Front	10539.0	10107.0	1.6	4.0	103.0	29.0	33.0	2255.0	96.0	167.0
Echo 2dr manual	Toyota	Sedan	Asia	Front	10760.0	10144.0	1.5	4.0	108.0	35.0	43.0	2035.0	93.0	163.0
Ion1 4dr	Saturn	Sedan	USA	Front	10995.0	10319.0	2.2	4.0	140.0	26.0	35.0	2692.0	103.0	185.0
Rio 4dr auto	Kia	Sedan	Asia	Front	11155.0	10705.0	1.6	4.0	104.0	25.0	32.0	2458.0	95.0	167.0

	Make	Type	Origin	DriveTrain	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
Model
911 GT2 2dr	Porsche	Sports	Europe	Rear	192465.0	173560.0	3.6	6.0	477.0	17.0	24.0	3131.0	93.0	175.0
CL600 2dr	Mercedes-Benz	Sedan	Europe	Rear	128420.0	119600.0	5.5	12.0	493.0	13.0	19.0	4473.0	114.0	196.0
SL600 convertible 2dr	Mercedes-Benz	Sports	Europe	Rear	126670.0	117854.0	5.5	12.0	493.0	13.0	19.0	4429.0	101.0	179.0
SL55 AMG 2dr	Mercedes-Benz	Sports	Europe	Rear	121770.0	113388.0	5.5	8.0	493.0	14.0	21.0	4235.0	101.0	179.0
CL500 2dr	Mercedes-Benz	Sedan	Europe	Rear	94820.0	88324.0	5.0	8.0	302.0	16.0	24.0	4085.0	114.0	196.0

	MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
count	428.000000	428.000000	428.000000	426.000000	428.000000	428.000000	428.000000	428.000000	428.000000	428.000000
mean	32774.855140	30014.700935	3.196729	5.807512	215.885514	20.060748	26.843458	3577.953271	108.154206	186.362150
std	19431.716674	17642.117750	1.108595	1.558443	71.836032	5.238218	5.741201	758.983215	8.311813	14.357991
min	10280.000000	9875.000000	1.300000	3.000000	73.000000	10.000000	12.000000	1850.000000	89.000000	143.000000
25%	20334.250000	18866.000000	2.375000	NaN	165.000000	17.000000	24.000000	3104.000000	103.000000	178.000000
50%	27635.000000	25294.500000	3.000000	NaN	210.000000	19.000000	26.000000	3474.500000	107.000000	187.000000
75%	39205.000000	35710.250000	3.900000	NaN	255.000000	21.250000	29.000000	3977.750000	112.000000	194.000000
max	192465.000000	173560.000000	8.300000	12.000000	500.000000	60.000000	66.000000	7190.000000	144.000000	238.000000

		MSRP	Invoice	EngineSize	Cylinders	Horsepower	MPG_City	MPG_Highway	Weight	Wheelbase	Length
Origin
Asia	count	158.000000	158.000000	158.000000	156.000000	158.000000	158.000000	158.000000	158.000000	158.000000	158.000000
	mean	24741.322785	22602.177215	2.774051	5.185897	190.702532	22.012658	28.265823	3319.316456	105.886076	182.816456
	std	11321.069675	9842.984880	0.902310	1.269008	59.392627	6.733307	6.770503	717.842132	7.735249	12.564148
	min	10280.000000	9875.000000	1.300000	3.000000	73.000000	13.000000	17.000000	1850.000000	89.000000	153.000000
	25%	17208.000000	16271.500000	2.000000	NaN	142.000000	18.000000	25.000000	2751.500000	102.000000	175.000000
	50%	23032.500000	20949.500000	2.600000	NaN	187.500000	20.500000	27.000000	3280.000000	105.000000	183.500000
	75%	28787.500000	26645.000000	3.500000	NaN	233.750000	24.000000	31.000000	3706.750000	110.000000	191.000000
	max	89765.000000	79978.000000	5.600000	8.000000	340.000000	60.000000	66.000000	5590.000000	140.000000	224.000000
Europe	count	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000	123.000000
	mean	48349.796748	44395.081301	3.206504	6.235772	251.894309	18.731707	26.008130	3680.723577	106.447154	181.845528
	std	25318.600464	23080.368888	1.037157	1.679585	80.738884	3.289509	4.167588	579.576871	6.690099	11.479870
	min	16999.000000	15437.000000	1.600000	4.000000	100.000000	12.000000	14.000000	2524.000000	93.000000	143.000000
	25%	33837.500000	31287.500000	2.500000	5.000000	195.500000	17.000000	24.000000	3285.000000	102.000000	177.000000
	50%	40590.000000	37575.000000	3.000000	6.000000	225.000000	19.000000	26.000000	3585.000000	107.000000	182.000000
	75%	56382.500000	51042.500000	4.200000	8.000000	302.000000	20.000000	28.500000	3959.500000	110.000000	190.000000
	max	192465.000000	173560.000000	6.000000	12.000000	493.000000	38.000000	46.000000	5423.000000	123.000000	204.000000
USA	count	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000
	mean	28377.442177	25949.340136	3.642857	6.108844	212.823129	19.074830	26.013605	3769.952381	112.020408	193.952381
	std	11711.982506	10518.722194	1.194221	1.531090	63.748618	3.982992	5.396582	855.305524	8.788590	15.305265
	min	10995.000000	10319.000000	1.600000	4.000000	103.000000	10.000000	12.000000	2348.000000	93.000000	150.000000
	25%	20315.000000	18927.000000	2.700000	5.000000	160.000000	17.000000	22.000000	3234.000000	106.000000	185.000000
	50%	25520.000000	23217.000000	3.600000	6.000000	200.000000	18.000000	26.000000	3606.000000	111.000000	194.000000
	75%	33767.500000	30778.000000	4.600000	8.000000	250.000000	21.000000	29.000000	4235.000000	115.000000	201.500000
	max	81795.000000	74451.000000	8.300000	10.000000	500.000000	29.000000	37.000000	7190.000000	144.000000	238.000000